Credit Card Users Churn Prediction¶

Description¶

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Objective¶

Customers’ leaving credit card services would lead the bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and the reason for same – so that the bank could improve upon those areas.

You as a Data Scientist at Thera Bank need to explore the data provided, identify patterns, and come up with a classification model to identify customers likely to churn, and provide actionable insights and recommendations that will help the bank improve its services so that customers do not renounce their credit cards.

Data Description¶

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag : Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age : Age in Years
  • Gender : The gender of the account holder
  • Dependent_count : Number of dependents
  • Education_Level : Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
  • Marital_Status : Marital Status of the account holder
  • Income_Category : Annual Income Category of the account holder
  • Card_Category : Type of Card
  • Months_on_book : Period of relationship with the bank
  • Total_Relationship_Count : Total no. of products held by the customer
  • Months_Inactive_12_mon : No. of months inactive in the last 12 months
  • Contacts_Count_12_mon : No. of Contacts between the customer and bank in the last 12 months Credit_Limit : Credit Limit on the Credit Card
  • Total_Revolving_Bal : The balance that carries over from one month to the next is the revolving balance
  • Avg_Open_To_Buy : Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
  • Total_Trans_Amt : Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct : Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1 : Ratio of the total transaction count in 4th quarter and the total transaction count in the 1st quarter
  • Total_Amt_Chng_Q4_Q1 : Ratio of the total transaction amount in 4th quarter and the total transaction amount in the 1st quarter
  • Avg_Utilization_Ratio : Represents how much of the available credit the customer spent

Importing necessary libraries and data¶

In [1]:
#libraries to read and manipulate data
import numpy as np
import pandas as pd

#libraries to visualize data
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

Data Overview¶

In [2]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [3]:
data = pd.read_csv("/content/drive/MyDrive/AIML/Featurization_Model_Selection & Tunning/BankChurners.csv")
In [4]:
df = data.copy()

First and last 5 rows of the dataset

In [5]:
# looking at head (first 5 observations)
df.head()
Out[5]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book ... Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 ... 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 ... 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 ... 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 ... 4 1 3313.0 2517 796.0 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 ... 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000

5 rows × 21 columns

In [6]:
# looking at tail (last 5 observations)
df.tail()
Out[6]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book ... Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 772366833 Existing Customer 50 M 2 Graduate Single $40K - $60K Blue 40 ... 2 3 4003.0 1851 2152.0 0.703 15476 117 0.857 0.462
10123 710638233 Attrited Customer 41 M 2 NaN Divorced $40K - $60K Blue 25 ... 2 3 4277.0 2186 2091.0 0.804 8764 69 0.683 0.511
10124 716506083 Attrited Customer 44 F 1 High School Married Less than $40K Blue 36 ... 3 4 5409.0 0 5409.0 0.819 10291 60 0.818 0.000
10125 717406983 Attrited Customer 30 M 2 Graduate NaN $40K - $60K Blue 36 ... 3 3 5281.0 0 5281.0 0.535 8395 62 0.722 0.000
10126 714337233 Attrited Customer 43 F 2 Graduate Married Less than $40K Silver 25 ... 2 4 10388.0 1961 8427.0 0.703 10294 61 0.649 0.189

5 rows × 21 columns

In [7]:
#Checking the shape of the dataset
df.shape
Out[7]:
(10127, 21)
  • There are 10127 row and 21 column present in dataset.
In [8]:
#Checking the data types of the columns for the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB
  • Data set consist of object, int, float data type of features in which 6, 10, 5 are object, int and float data type respectively.
  • Attrition Flag is our target variable.
  • Education level and marital status consist if missing values we need to imput then going forward
In [9]:
#checking for duplicates
df.isna().sum()
Out[9]:
0
CLIENTNUM 0
Attrition_Flag 0
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 1519
Marital_Status 749
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0

  • Education level and marital status consist if missing values we need to imput then going forward
In [10]:
df.describe().T
Out[10]:
count mean std min 25% 50% 75% max
CLIENTNUM 10127.0 7.391776e+08 3.690378e+07 708082083.0 7.130368e+08 7.179264e+08 7.731435e+08 8.283431e+08
Customer_Age 10127.0 4.632596e+01 8.016814e+00 26.0 4.100000e+01 4.600000e+01 5.200000e+01 7.300000e+01
Dependent_count 10127.0 2.346203e+00 1.298908e+00 0.0 1.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00
Months_on_book 10127.0 3.592841e+01 7.986416e+00 13.0 3.100000e+01 3.600000e+01 4.000000e+01 5.600000e+01
Total_Relationship_Count 10127.0 3.812580e+00 1.554408e+00 1.0 3.000000e+00 4.000000e+00 5.000000e+00 6.000000e+00
Months_Inactive_12_mon 10127.0 2.341167e+00 1.010622e+00 0.0 2.000000e+00 2.000000e+00 3.000000e+00 6.000000e+00
Contacts_Count_12_mon 10127.0 2.455317e+00 1.106225e+00 0.0 2.000000e+00 2.000000e+00 3.000000e+00 6.000000e+00
Credit_Limit 10127.0 8.631954e+03 9.088777e+03 1438.3 2.555000e+03 4.549000e+03 1.106750e+04 3.451600e+04
Total_Revolving_Bal 10127.0 1.162814e+03 8.149873e+02 0.0 3.590000e+02 1.276000e+03 1.784000e+03 2.517000e+03
Avg_Open_To_Buy 10127.0 7.469140e+03 9.090685e+03 3.0 1.324500e+03 3.474000e+03 9.859000e+03 3.451600e+04
Total_Amt_Chng_Q4_Q1 10127.0 7.599407e-01 2.192068e-01 0.0 6.310000e-01 7.360000e-01 8.590000e-01 3.397000e+00
Total_Trans_Amt 10127.0 4.404086e+03 3.397129e+03 510.0 2.155500e+03 3.899000e+03 4.741000e+03 1.848400e+04
Total_Trans_Ct 10127.0 6.485869e+01 2.347257e+01 10.0 4.500000e+01 6.700000e+01 8.100000e+01 1.390000e+02
Total_Ct_Chng_Q4_Q1 10127.0 7.122224e-01 2.380861e-01 0.0 5.820000e-01 7.020000e-01 8.180000e-01 3.714000e+00
Avg_Utilization_Ratio 10127.0 2.748936e-01 2.756915e-01 0.0 2.300000e-02 1.760000e-01 5.030000e-01 9.990000e-01
  • Mean and median customer age is 46 years, customer having credit card of 26 years minimum and 73 years of maximum age
  • Dependent Count column has mean and median of around 2
  • Months on Book column has mean and median of 36 months. Minimum value is 13 months, showing that the dataset captures data for customers with the bank at least 1 whole years
  • Total Relationship Count has mean and median of ~4
  • The customer which offered hightest credit card limit is 34.5K and lowest is 1438.
  • Total Transaction Count has mean of ~65 and median of 67
In [11]:
df.describe(include = 'object').T
Out[11]:
count unique top freq
Attrition_Flag 10127 2 Existing Customer 8500
Gender 10127 2 F 5358
Education_Level 8608 6 Graduate 3128
Marital_Status 9378 3 Married 4687
Income_Category 10127 6 Less than $40K 3561
Card_Category 10127 4 Blue 9436
  • Attrition flag, gender having 2 unique values.
  • Education level and income category is divided into 6 departments.
  • There are 4 types of credit card issued to customer.
In [12]:
df.select_dtypes(include = 'object').nunique()
Out[12]:
0
Attrition_Flag 2
Gender 2
Education_Level 6
Marital_Status 3
Income_Category 6
Card_Category 4

In [13]:
#Dropping the client ID column
df.drop(columns = 'CLIENTNUM', inplace = True)
In [14]:
object_col = df.select_dtypes(include='object').columns.tolist()
df[object_col] = df[object_col].astype('category')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Attrition_Flag            10127 non-null  category
 1   Customer_Age              10127 non-null  int64   
 2   Gender                    10127 non-null  category
 3   Dependent_count           10127 non-null  int64   
 4   Education_Level           8608 non-null   category
 5   Marital_Status            9378 non-null   category
 6   Income_Category           10127 non-null  category
 7   Card_Category             10127 non-null  category
 8   Months_on_book            10127 non-null  int64   
 9   Total_Relationship_Count  10127 non-null  int64   
 10  Months_Inactive_12_mon    10127 non-null  int64   
 11  Contacts_Count_12_mon     10127 non-null  int64   
 12  Credit_Limit              10127 non-null  float64 
 13  Total_Revolving_Bal       10127 non-null  int64   
 14  Avg_Open_To_Buy           10127 non-null  float64 
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64 
 16  Total_Trans_Amt           10127 non-null  int64   
 17  Total_Trans_Ct            10127 non-null  int64   
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64 
 19  Avg_Utilization_Ratio     10127 non-null  float64 
dtypes: category(6), float64(5), int64(9)
memory usage: 1.1 MB
  • Dataset info after dropping customer ID feature

Exploratory Data Analysis¶

In [15]:
def boxhist_plot(data, column, figsize = (16,6)):
  """
    data: dataframe
    column: column name
    figsize: size of figure
  """

  fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=figsize) # Create a figure and a set of subplots

  sns.boxplot(data[column], ax=ax1) # Plot the boxplot on the first subplot (ax1)

  sns.histplot(data[column], ax=ax2) # Plot the histogram on the second subplot (ax2)

  sns.kdeplot(data[column], ax=ax3) # Plot the kernel density estimate on the third subplot (ax3)

  sns.violinplot(data[column], ax=ax4) # Plot the violin plot on the fourth subplot (ax4)

  ax2.axvline(
      data[column].median(),
      color='green',
      linestyle='dashed',
      linewidth=2,
      label='Mean'
  )

  ax2.axvline(
      data[column].mean(),
      color='red',
      linestyle='dashed',
      linewidth=2,
      label='Mean'
  )

  ax1.set_title(f'Boxplot of {column}') # Set the title of the first subplot
  ax2.set_title(f'Histogram of {column}') # Set the title of the second subplot

  plt.show() # Display the plot
In [16]:
# function to create labeled barplots

def labeled_barplot(data, feature, feature_2, order, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    feature_2: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette='coolwarm',
        order=order,
        hue=feature_2,
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=9,
            xytext=(0, 5),
            textcoords="offset points"
        )  # annotate the percentage

    plt.show()  # show the plot
In [17]:
boxhist_plot(df, 'Customer_Age')
  • Credit card user customer age start from 26 to 73 years old.
  • Customer age mean and median are at same position it shows normal distrubution of graph.
In [18]:
boxhist_plot(df, 'Dependent_count')
  • There are maximum number of dependent is 5 and average 2 family members are dependable on credit card holder.
In [19]:
boxhist_plot(df, 'Months_on_book')
  • Loyal customer stay around 35 months with the offered credit card
  • Some customer only take credit card for 10 month of use
  • The average use of credit card is around 35 months from customer
In [20]:
boxhist_plot(df, 'Total_Relationship_Count')
  • Some customer hold maximum 7 products offered by bank
  • Average product held by customer is 3
  • Few of dont like to subscribe for more than 1 product
In [21]:
boxhist_plot(df, 'Months_Inactive_12_mon')
  • Most of credit cards are inactive for 3 months
  • Majorly credit card are not in use upto 3 months
In [22]:
boxhist_plot(df, 'Contacts_Count_12_mon')
  • Bank reach out to customer maximum 6 times per year
  • In last 12 months average of bank contacted to customer is 2.5 times
In [23]:
boxhist_plot(df, 'Credit_Limit')
  • Credit limit varies customer to customer thsts why it skewed towards right
  • Those customer whose having major transation with bank only these are get credit card limit upto 40000.
  • Average credit card limit offered by bank is 8631
In [24]:
boxhist_plot(df, 'Total_Revolving_Bal')
  • The averaged account balance that carry forward to next month is around 1200
In [25]:
boxhist_plot(df, 'Avg_Open_To_Buy')
  • Most of customer left the amount on credit card inbetween 0 to 4500
In [26]:
boxhist_plot(df, 'Total_Trans_Amt')
  • Customer average transation amount between 12 months is 0 to 5000
  • Bigger customer can make transation of more than 15000
In [27]:
boxhist_plot(df, 'Total_Trans_Ct')
In [28]:
boxhist_plot(df, 'Total_Ct_Chng_Q4_Q1')
  • Quatarly customer makes around 500 transation
In [29]:
boxhist_plot(df, 'Total_Amt_Chng_Q4_Q1')
In [30]:
boxhist_plot(df, 'Avg_Utilization_Ratio')
In [31]:
labeled_barplot(df, 'Gender', 'Attrition_Flag', order=df.Gender.value_counts().index)
  • Females are likely to close their account than men
  • Most of females no like to contionue with existing credit card
In [32]:
labeled_barplot(df, 'Education_Level', 'Attrition_Flag', order=df.Education_Level.value_counts().index)
  • Most credit card users are graduate ones
  • Also, maximum number of closed accounts are closed by graduate rather than other
In [33]:
labeled_barplot(df, 'Marital_Status', 'Attrition_Flag', order=df.Marital_Status.value_counts().index)
  • Married customer shows the higher use of credit card
In [34]:
labeled_barplot(df, 'Income_Category', 'Attrition_Flag', order=df.Income_Category.value_counts().index)
  • Higher income category uses the credit cards it also shows the bank only offer credit cards which are financially strong to pay their bills
In [35]:
labeled_barplot(df, 'Card_Category', 'Attrition_Flag', order=df.Card_Category.value_counts().index)
  • Blue card category is common offered card to customer
In [36]:
num_col = df.select_dtypes(include='number').columns.tolist()
num_col
Out[36]:
['Customer_Age',
 'Dependent_count',
 'Months_on_book',
 'Total_Relationship_Count',
 'Months_Inactive_12_mon',
 'Contacts_Count_12_mon',
 'Credit_Limit',
 'Total_Revolving_Bal',
 'Avg_Open_To_Buy',
 'Total_Amt_Chng_Q4_Q1',
 'Total_Trans_Amt',
 'Total_Trans_Ct',
 'Total_Ct_Chng_Q4_Q1',
 'Avg_Utilization_Ratio']
In [53]:
sns.pairplot(df[num_col], diag_kind='kde', corner=True)
Out[53]:
<seaborn.axisgrid.PairGrid at 0x7d3e6b974a60>
  • Customer_Age: The distribution appears to be slightly right-skewed, indicating a higher concentration of younger customers.
  • Credit_Limit, Avg_Open_To_Buy: Both are right-skewed, meaning a few customers have considerably higher limits and open-to-buy amounts than the majority.
  • Total_Trans_Amt, Total_Trans_Ct: These are also right-skewed, showing that a smaller group of customers make significantly more and larger transactions.
  • Total_Trans_Amt & Total_Trans_Ct: There's a strong positive correlation, which is expected as customers with more transactions tend to have higher total transaction amounts.
  • Credit_Limit & Avg_Open_To_Buy: These show a positive correlation, indicating that higher credit limits often result in higher average open-to-buy amounts.
  • Customer_Age & Months_on_book: A slight positive correlation may exist, suggesting that older customers might have longer relationships with the bank.
In [38]:
def boxplot_with_target(data: pd.DataFrame, numeric_columns, target, include_outliers):

    subplot_cols = 2
    subplot_rows = int(len(numeric_columns) / 2 + 1)
    plt.figure(figsize=(16, 3 * subplot_rows))
    for i, col in enumerate(numeric_columns):
        plt.subplot(8, 2, i + 1)
        sns.boxplot(
            data=data,
            x=target,
            y=col,
            orient="vertical",
            palette="Blues",
            showfliers=include_outliers,
        )
        plt.xlabel(target, fontsize=12)
        plt.ylabel(col, fontsize=12)
        plt.xticks(fontsize=12)
        plt.yticks(fontsize=12)
    plt.tight_layout()
    plt.show()
In [39]:
boxplot_with_target(df, num_col, 'Attrition_Flag', include_outliers=True)
  • Attrited Customers tend to have:

  • Lower Total_Trans_Amt and Total_Trans_Ct values. This means they make fewer and smaller transactions compared to Existing Customers.

  • Lower Avg_Utilization_Ratio. They tend to utilize a smaller portion of their available credit.
  • Lower Total_Revolving_Bal. They tend to carry over a lower balance from one month to the next.
  • Higher Total_Ct_Chng_Q4_Q1 and Total_Amt_Chng_Q4_Q1 values. This indicates a larger change in their transaction behavior between the 4th and 1st quarters compared to Existing Customers.
  • Similar Credit_Limit and Months_on_book to Existing Customers.
  • Similar Customer_Age to Existing Customers.
In [40]:
boxplot_with_target(df, num_col, 'Attrition_Flag', include_outliers=False)
In [41]:
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
  • Total_Trans_Amt and Total_Trans_Ct have a strong positive correlation (0.81). This is expected as more transactions usually lead to a higher total transaction amount.
  • Credit_Limit and Avg_Open_To_Buy show a strong positive correlation (0.99). This makes sense because a higher credit limit typically results in a larger average open-to-buy amount.
  • Avg_Utilization_Ratio and Credit_Limit have a negative correlation (-0.48). This suggests that customers with higher credit limits tend to have lower average utilization ratios, indicating they might be using a smaller portion of their available credit.
  • Total_Trans_Amt and Avg_Utilization_Ratio have a moderate positive correlation (0.36), indicating customers with higher transaction amounts tend to have higher utilization ratios.
  • Total_Revolving_Bal and Avg_Utilization_Ratio have a strong positive correlation (0.62). This implies that a higher revolving balance is associated with a higher utilization ratio, suggesting that customers who carry a higher balance tend to use more of their available credit.

EDA Insights¶

Customer Demographics and Behavior:

  • Customer age ranges from 26 to 73, is normally distributed, and has an average and median around 46 years.
  • Most credit card users are graduates, and they also account for the highest number of account closures.
  • Married customers exhibit higher credit card usage.
  • Females are slightly more likely to close their accounts than males.
  • Higher income categories are associated with increased credit card usage.
  • The Blue card is the most commonly offered card type.
  • Loyal customers stay with the credit card for an average of 35 months.

Transaction Patterns:

  • Total transaction amount and count are positively correlated.
  • Credit limit and average open-to-buy are strongly positively correlated.
  • Average utilization ratio and credit limit have a negative correlation.
  • Total transaction amount and average utilization ratio have a moderate positive correlation.
  • Total revolving balance and average utilization ratio are strongly positively correlated.
  • Customers with lower total transaction amounts, lower utilization ratios, and higher changes in transaction behavior between quarters are more likely to churn.

Data Processing¶

In [42]:
#Outlier Detection

plt.figure(figsize=(10,30))
for i, variable in enumerate(num_col):
  plt.subplot(8,2,i+1)
  sns.boxplot(df[num_col[i]])
  plt.tight_layout()
  plt.title(variable)

plt.show()
In [43]:
X = df.drop(columns='Attrition_Flag', axis=1)
y = df['Attrition_Flag']
In [44]:
X = pd.get_dummies(X, drop_first=True)
In [45]:
X.shape
Out[45]:
(10127, 30)
In [46]:
X.replace(True, 1, inplace = True)
X.replace(False, 0, inplace = True)
In [47]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 30 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Customer_Age                    10127 non-null  int64  
 1   Dependent_count                 10127 non-null  int64  
 2   Months_on_book                  10127 non-null  int64  
 3   Total_Relationship_Count        10127 non-null  int64  
 4   Months_Inactive_12_mon          10127 non-null  int64  
 5   Contacts_Count_12_mon           10127 non-null  int64  
 6   Credit_Limit                    10127 non-null  float64
 7   Total_Revolving_Bal             10127 non-null  int64  
 8   Avg_Open_To_Buy                 10127 non-null  float64
 9   Total_Amt_Chng_Q4_Q1            10127 non-null  float64
 10  Total_Trans_Amt                 10127 non-null  int64  
 11  Total_Trans_Ct                  10127 non-null  int64  
 12  Total_Ct_Chng_Q4_Q1             10127 non-null  float64
 13  Avg_Utilization_Ratio           10127 non-null  float64
 14  Gender_M                        10127 non-null  int64  
 15  Education_Level_Doctorate       10127 non-null  int64  
 16  Education_Level_Graduate        10127 non-null  int64  
 17  Education_Level_High School     10127 non-null  int64  
 18  Education_Level_Post-Graduate   10127 non-null  int64  
 19  Education_Level_Uneducated      10127 non-null  int64  
 20  Marital_Status_Married          10127 non-null  int64  
 21  Marital_Status_Single           10127 non-null  int64  
 22  Income_Category_$40K - $60K     10127 non-null  int64  
 23  Income_Category_$60K - $80K     10127 non-null  int64  
 24  Income_Category_$80K - $120K    10127 non-null  int64  
 25  Income_Category_Less than $40K  10127 non-null  int64  
 26  Income_Category_abc             10127 non-null  int64  
 27  Card_Category_Gold              10127 non-null  int64  
 28  Card_Category_Platinum          10127 non-null  int64  
 29  Card_Category_Silver            10127 non-null  int64  
dtypes: float64(5), int64(25)
memory usage: 2.3 MB
In [48]:
from sklearn.model_selection import train_test_split

# Splitting data into training, validation and test set:
# first we split data into 2 parts, temporary and test

X_temp, X_test, y_temp, y_test = train_test_split( X, y, test_size=0.30, random_state=40, stratify=y )

# then we split the temporary set into train and validation

X_train, X_val, y_train, y_val = train_test_split( X_temp, y_temp, test_size=0.30, random_state=40, stratify=y_temp )
In [49]:
#confirm the shape of both data sets and the ratio of classes is the same across train, validation and test datasets

print("Shape of Training set : ", X_train.shape)
print("Shape of validation set : ", X_val.shape)
print("Shape of test set : ", X_test.shape)
print(' ')

print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print(' ')

print("Percentage of classes in validation set:")
print(y_val.value_counts(normalize=True))
print(' ')

print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (4961, 30)
Shape of validation set :  (2127, 30)
Shape of test set :  (3039, 30)
 
Percentage of classes in training set:
Attrition_Flag
Existing Customer    0.839347
Attrited Customer    0.160653
Name: proportion, dtype: float64
 
Percentage of classes in validation set:
Attrition_Flag
Existing Customer    0.83921
Attrited Customer    0.16079
Name: proportion, dtype: float64
 
Percentage of classes in test set:
Attrition_Flag
Existing Customer    0.839421
Attrited Customer    0.160579
Name: proportion, dtype: float64

Model Building¶

In [57]:
#libraries for metrics and statistics
from sklearn import metrics
import scipy.stats as stats
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score, roc_auc_score


def model_score(model,X_test, y_test):
    '''
    model_score : extract performed model score in single code snippet

    '''
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)

    cm = metrics.confusion_matrix(y_test, y_pred)

    true_positive = cm[0][0]
    false_positive = cm[0][1]
    false_negative = cm[1][0]
    true_negative = cm[1][1]

    Precision = true_positive/(true_positive+false_positive)

    Recall = true_positive/(true_positive+false_negative)

    F1_Score = 2*(Recall * Precision) / (Recall + Precision)

    print("Confusion Matrix: ", cm)
    print("Accuracy Score: ", acc)
    print("Precision Score: ", Precision)
    print("Recall Score: ", Recall)
    print("F1 Score: ", F1_Score)

Original Dataset¶

1. Decision Tree Model

In [55]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
In [58]:
dtree = DecisionTreeClassifier(random_state=40)
dtree.fit(X_train, y_train)
Out[58]:
DecisionTreeClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=40)
In [59]:
print(dtree.score(X_train, y_train))
print(dtree.score(X_val, y_val))
1.0
0.9459332393041843
In [65]:
dtree_modelscore = model_score(dtree, X_test, y_test)
Confusion Matrix:  [[ 392   96]
 [ 104 2447]]
Accuracy Score:  0.9341888779203685
Precision Score:  0.8032786885245902
Recall Score:  0.7903225806451613
F1 Score:  0.7967479674796747

2. Bagging

In [61]:
from sklearn.ensemble import BaggingClassifier
In [62]:
bagging = BaggingClassifier(random_state=40)
bagging.fit(X_train, y_train)
Out[62]:
BaggingClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(random_state=40)
In [63]:
print(bagging.score(X_train, y_train))
print(bagging.score(X_val, y_val))
0.9977827050997783
0.9619181946403385
In [66]:
bagging_modelscore = model_score(bagging, X_test, y_test)
Confusion Matrix:  [[ 415   73]
 [  69 2482]]
Accuracy Score:  0.9532741033234616
Precision Score:  0.8504098360655737
Recall Score:  0.8574380165289256
F1 Score:  0.8539094650205761

3. Boosting

3.1 Ada Boost

In [68]:
from sklearn.ensemble import AdaBoostClassifier
In [69]:
adaboost = AdaBoostClassifier(random_state=40)
adaboost.fit(X_train, y_train)
Out[69]:
AdaBoostClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(random_state=40)
In [70]:
print(adaboost.score(X_train, y_train))
print(adaboost.score(X_val, y_val))
0.9621044144325741
0.9576868829337094
In [71]:
adaboost_modelscore = model_score(adaboost, X_test, y_test)
Confusion Matrix:  [[ 404   84]
 [  53 2498]]
Accuracy Score:  0.9549193813754524
Precision Score:  0.8278688524590164
Recall Score:  0.8840262582056893
F1 Score:  0.8550264550264551

3.2 Gradient Boost

In [72]:
from sklearn.ensemble import GradientBoostingClassifier
In [73]:
gradientboost = GradientBoostingClassifier(random_state=40)
gradientboost.fit(X_train, y_train)
Out[73]:
GradientBoostingClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(random_state=40)
In [74]:
print(gradientboost.score(X_train, y_train))
print(gradientboost.score(X_val, y_val))
0.975609756097561
0.9680300893276916
In [75]:
gradientboost_modelscore = model_score(gradientboost, X_test, y_test)
Confusion Matrix:  [[ 395   93]
 [  30 2521]]
Accuracy Score:  0.9595261599210266
Precision Score:  0.8094262295081968
Recall Score:  0.9294117647058824
F1 Score:  0.8652792990142388

4. RandomForest

In [78]:
from sklearn.ensemble import RandomForestClassifier
In [79]:
randomforest = RandomForestClassifier(random_state=40)
randomforest.fit(X_train, y_train)
Out[79]:
RandomForestClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=40)
In [80]:
print(randomforest.score(X_train, y_train))
print(randomforest.score(X_val, y_val))
1.0
0.9543958627174424
In [81]:
randomforest_modelscore = model_score(randomforest, X_test, y_test)
Confusion Matrix:  [[ 373  115]
 [  34 2517]]
Accuracy Score:  0.9509707140506746
Precision Score:  0.764344262295082
Recall Score:  0.9164619164619164
F1 Score:  0.8335195530726257

Oversampled Dataset¶

In [209]:
df_over = df.copy()
In [210]:
X = df_over.drop(columns=['Attrition_Flag', 'Education_Level', 'Marital_Status'], axis=1)
y = df_over['Attrition_Flag']
In [211]:
X = pd.get_dummies(X, drop_first=True)
In [212]:
X.replace(True, 1, inplace = True)
X.replace(False, 0, inplace = True)
In [213]:
y.replace('Attrited Customer', 1, inplace = True)
y.replace('Existing Customer', 0, inplace = True)
In [214]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 23 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Customer_Age                    10127 non-null  int64  
 1   Dependent_count                 10127 non-null  int64  
 2   Months_on_book                  10127 non-null  int64  
 3   Total_Relationship_Count        10127 non-null  int64  
 4   Months_Inactive_12_mon          10127 non-null  int64  
 5   Contacts_Count_12_mon           10127 non-null  int64  
 6   Credit_Limit                    10127 non-null  float64
 7   Total_Revolving_Bal             10127 non-null  int64  
 8   Avg_Open_To_Buy                 10127 non-null  float64
 9   Total_Amt_Chng_Q4_Q1            10127 non-null  float64
 10  Total_Trans_Amt                 10127 non-null  int64  
 11  Total_Trans_Ct                  10127 non-null  int64  
 12  Total_Ct_Chng_Q4_Q1             10127 non-null  float64
 13  Avg_Utilization_Ratio           10127 non-null  float64
 14  Gender_M                        10127 non-null  int64  
 15  Income_Category_$40K - $60K     10127 non-null  int64  
 16  Income_Category_$60K - $80K     10127 non-null  int64  
 17  Income_Category_$80K - $120K    10127 non-null  int64  
 18  Income_Category_Less than $40K  10127 non-null  int64  
 19  Income_Category_abc             10127 non-null  int64  
 20  Card_Category_Gold              10127 non-null  int64  
 21  Card_Category_Platinum          10127 non-null  int64  
 22  Card_Category_Silver            10127 non-null  int64  
dtypes: float64(5), int64(18)
memory usage: 1.8 MB
In [215]:
# Splitting data into training, validation and test set:
# first we split data into 2 parts, temporary and test

X_temp, X_testover, y_temp, y_testover = train_test_split( X, y, test_size=0.30, random_state=40, stratify=y )

# then we split the temporary set into train and validation

X_train, X_valover, y_train, y_valover = train_test_split( X_temp, y_temp, test_size=0.30, random_state=40, stratify=y_temp )
In [216]:
# To oversample data
from imblearn.over_sampling import SMOTE

print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

sm = SMOTE( sampling_strategy="minority", k_neighbors=10, random_state=40 )

X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))

print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 797
Before UpSampling, counts of label 'No': 4164 

After UpSampling, counts of label 'Yes': 4164
After UpSampling, counts of label 'No': 4164 

After UpSampling, the shape of train_X: (8328, 23)
After UpSampling, the shape of train_y: (8328,) 

1. Decision Tree Model

In [219]:
dtree = DecisionTreeClassifier(random_state=40)
dtree.fit(X_train_over, y_train_over)
Out[219]:
DecisionTreeClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=40)
In [220]:
print(dtree.score(X_train_over, y_train_over))
print(dtree.score(X_valover, y_valover))
1.0
0.9280677009873061
In [224]:
dtreeover_modelscore = model_score(dtree, X_testover, y_testover)
Confusion Matrix:  [[2393  158]
 [  75  413]]
Accuracy Score:  0.9233300427772293
Precision Score:  0.9380635045080361
Recall Score:  0.9696110210696921
F1 Score:  0.9535764096433553

2. Bagging

In [225]:
bagging = BaggingClassifier(random_state=40)
bagging.fit(X_train_over, y_train_over)
Out[225]:
BaggingClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(random_state=40)
In [226]:
print(bagging.score(X_train_over, y_train_over))
print(bagging.score(X_valover, y_valover))
0.9983189241114313
0.9431123648330982
In [227]:
baggingover_modelscore = model_score(bagging, X_testover, y_testover)
Confusion Matrix:  [[2449  102]
 [  73  415]]
Accuracy Score:  0.9424152681803225
Precision Score:  0.960015680125441
Recall Score:  0.9710547184773989
F1 Score:  0.9655036467573428

3. Boosting

3.1 Ada Boost

In [228]:
adaboost = AdaBoostClassifier(random_state=40)
adaboost.fit(X_train_over, y_train_over)
Out[228]:
AdaBoostClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(random_state=40)
In [229]:
print(adaboost.score(X_train_over, y_train_over))
print(adaboost.score(X_valover, y_valover))
0.9612151777137368
0.9402914903620122
In [230]:
adaboostover_modelscore = model_score(adaboost, X_testover, y_testover)
Confusion Matrix:  [[2407  144]
 [  64  424]]
Accuracy Score:  0.9315564330371833
Precision Score:  0.9435515484123873
Recall Score:  0.9740995548360988
F1 Score:  0.9585822381521307

3.2 Gradient Boost

In [231]:
gradientboost = GradientBoostingClassifier(random_state=40)
gradientboost.fit(X_train_over, y_train_over)
Out[231]:
GradientBoostingClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(random_state=40)
In [232]:
print(gradientboost.score(X_train_over, y_train_over))
print(gradientboost.score(X_valover, y_valover))
0.9780259365994236
0.9543958627174424
In [234]:
gradientboostover_modelscore = model_score(gradientboost, X_testover, y_testover)
Confusion Matrix:  [[2439  112]
 [  56  432]]
Accuracy Score:  0.9447186574531096
Precision Score:  0.9560956487651902
Recall Score:  0.9775551102204408
F1 Score:  0.9667063020214032

4. RandomForest

In [235]:
randomforest = RandomForestClassifier(random_state=40)
randomforest.fit(X_train_over, y_train_over)
Out[235]:
RandomForestClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=40)
In [236]:
print(randomforest.score(X_train_over, y_train_over))
print(randomforest.score(X_valover, y_valover))
1.0
0.9567465914433474
In [237]:
randomforestover_modelscore = model_score(randomforest, X_testover, y_testover)
Confusion Matrix:  [[2473   78]
 [  83  405]]
Accuracy Score:  0.9470220467258966
Precision Score:  0.9694237553900431
Recall Score:  0.9675273865414711
F1 Score:  0.9684746426473467

Undersampled Dataset¶

In [239]:
# To undersample data
from imblearn.under_sampling import RandomUnderSampler
In [240]:
rus = RandomUnderSampler(random_state=40)
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)
In [243]:
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_under == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_under == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_under.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_under.shape))
Before Under Sampling, counts of label 'Yes': 797
Before Under Sampling, counts of label 'No': 4164 

After Under Sampling, counts of label 'Yes': 797
After Under Sampling, counts of label 'No': 797 

After Under Sampling, the shape of train_X: (1594, 23)
After Under Sampling, the shape of train_y: (1594,) 

1. Decision Tree Model

In [244]:
dtree = DecisionTreeClassifier(random_state=40)
dtree.fit(X_train_under, y_train_under)
Out[244]:
DecisionTreeClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=40)
In [245]:
print(dtree.score(X_train_under, y_train_under))
print(dtree.score(X_valover, y_valover))
1.0
0.8951574988246357
In [246]:
dtreeunder_modelscore = model_score(dtree, X_testover, y_testover)
Confusion Matrix:  [[2270  281]
 [  61  427]]
Accuracy Score:  0.8874629812438302
Precision Score:  0.8898471187769502
Recall Score:  0.9738309738309738
F1 Score:  0.9299467431380581

2. Bagging

In [247]:
bagging = BaggingClassifier(random_state=40)
bagging.fit(X_train_under, y_train_under)
Out[247]:
BaggingClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(random_state=40)
In [248]:
print(bagging.score(X_train_under, y_train_under))
print(bagging.score(X_valover, y_valover))
0.993099121706399
0.9355900329102022
In [249]:
baggingunder_modelscore = model_score(bagging, X_testover, y_testover)
Confusion Matrix:  [[2366  185]
 [  51  437]]
Accuracy Score:  0.9223428759460349
Precision Score:  0.9274794198353586
Recall Score:  0.9788994621431527
F1 Score:  0.9524959742351048

3. Boosting

3.1 Ada Boost

In [250]:
adaboost = AdaBoostClassifier(random_state=40)
adaboost.fit(X_train_under, y_train_under)
Out[250]:
AdaBoostClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(random_state=40)
In [251]:
print(adaboost.score(X_train_under, y_train_under))
print(adaboost.score(X_valover, y_valover))
0.9554579673776662
0.9337094499294781
In [252]:
adaboostunder_modelscore = model_score(adaboost, X_testover, y_testover)
Confusion Matrix:  [[2351  200]
 [  36  452]]
Accuracy Score:  0.9223428759460349
Precision Score:  0.9215993727949824
Recall Score:  0.9849183074989527
F1 Score:  0.9522073714054273

3.2 Gradient Boost

In [253]:
gradientboost = GradientBoostingClassifier(random_state=40)
gradientboost.fit(X_train_under, y_train_under)
Out[253]:
GradientBoostingClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(random_state=40)
In [254]:
print(gradientboost.score(X_train_under, y_train_under))
print(gradientboost.score(X_valover, y_valover))
0.9849435382685069
0.9431123648330982
In [255]:
gradientboostunder_modelscore = model_score(gradientboost, X_testover, y_testover)
Confusion Matrix:  [[2392  159]
 [  32  456]]
Accuracy Score:  0.937150378413952
Precision Score:  0.9376715013720109
Recall Score:  0.9867986798679867
F1 Score:  0.9616080402010051

4. RandomForest

In [256]:
randomforest = RandomForestClassifier(random_state=40)
randomforest.fit(X_train_under, y_train_under)
Out[256]:
RandomForestClassifier(random_state=40)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=40)
In [257]:
print(randomforest.score(X_train_under, y_train_under))
print(randomforest.score(X_valover, y_valover))
1.0
0.9355900329102022
In [258]:
randomforestover_modelscore = model_score(randomforest, X_testover, y_testover)
Confusion Matrix:  [[2366  185]
 [  42  446]]
Accuracy Score:  0.9253043764396183
Precision Score:  0.9274794198353586
Recall Score:  0.9825581395348837
F1 Score:  0.9542246420649324

Model Comparision¶

In [269]:
# Create a list of model names and scores
model_names = ['Decision Tree', 'Bagging', 'AdaBoost', 'Gradient Boost', 'Random Forest',
               'Decision Tree (Oversampled)', 'Bagging (Oversampled)', 'AdaBoost (Oversampled)',
               'Gradient Boost (Oversampled)', 'Random Forest (Oversampled)',
               'Decision Tree (Undersampled)', 'Bagging (Undersampled)', 'AdaBoost (Undersampled)',
               'Gradient Boost (Undersampled)', 'Random Forest (Undersampled)']

# Create a list of dictionaries, where each dictionary represents a model's scores
model_scores = [
    {'Accuracy': 0.934, 'Precision': 0.962, 'Recall': 0.958, 'F1-Score': 0.960},  # Decision Tree
    {'Accuracy': 0.962, 'Precision': 0.983, 'Recall': 0.973, 'F1-Score': 0.978},  # Bagging
    {'Accuracy': 0.957, 'Precision': 0.972, 'Recall': 0.970, 'F1-Score': 0.971},  # AdaBoost
    {'Accuracy': 0.961, 'Precision': 0.979, 'Recall': 0.974, 'F1-Score': 0.976},  # Gradient Boost
    {'Accuracy': 0.964, 'Precision': 0.987, 'Recall': 0.972, 'F1-Score': 0.980},  # Random Forest
    {'Accuracy': 0.911, 'Precision': 0.917, 'Recall': 0.975, 'F1-Score': 0.945},  # Decision Tree (Oversampled)
    {'Accuracy': 0.956, 'Precision': 0.965, 'Recall': 0.981, 'F1-Score': 0.973},  # Bagging (Oversampled)
    {'Accuracy': 0.934, 'Precision': 0.934, 'Recall': 0.987, 'F1-Score': 0.960},  # AdaBoost (Oversampled)
    {'Accuracy': 0.950, 'Precision': 0.954, 'Recall': 0.985, 'F1-Score': 0.969},  # Gradient Boost (Oversampled)
    {'Accuracy': 0.960, 'Precision': 0.967, 'Recall': 0.984, 'F1-Score': 0.975},  # Random Forest (Oversampled)
    {'Accuracy': 0.825, 'Precision': 0.882, 'Recall': 0.887, 'F1-Score': 0.885},  # Decision Tree (Undersampled)
    {'Accuracy': 0.864, 'Precision': 0.903, 'Recall': 0.918, 'F1-Score': 0.910},  # Bagging (Undersampled)
    {'Accuracy': 0.892, 'Precision': 0.915, 'Recall': 0.943, 'F1-Score': 0.929},  # AdaBoost (Undersampled)
    {'Accuracy': 0.896, 'Precision': 0.916, 'Recall': 0.945, 'F1-Score': 0.930},  # Gradient Boost (Undersampled)
    {'Accuracy': 0.898, 'Precision': 0.919, 'Recall': 0.944, 'F1-Score': 0.931}  # Random Forest
    ]


# Create a DataFrame from the list of dictionaries
df_model_scores = pd.DataFrame(model_scores, index=model_names)

# Display the DataFrame
df_model_scores

# Sorting models in decreasing order of test recall
df_model_scores.sort_values(
    by=["Accuracy", "Recall"], ascending=False
).style.highlight_max(color="lightgreen", axis=0).highlight_min(color="pink", axis=0)
Out[269]:
  Accuracy Precision Recall F1-Score
Random Forest 0.964000 0.987000 0.972000 0.980000
Bagging 0.962000 0.983000 0.973000 0.978000
Gradient Boost 0.961000 0.979000 0.974000 0.976000
Random Forest (Oversampled) 0.960000 0.967000 0.984000 0.975000
AdaBoost 0.957000 0.972000 0.970000 0.971000
Bagging (Oversampled) 0.956000 0.965000 0.981000 0.973000
Gradient Boost (Oversampled) 0.950000 0.954000 0.985000 0.969000
AdaBoost (Oversampled) 0.934000 0.934000 0.987000 0.960000
Decision Tree 0.934000 0.962000 0.958000 0.960000
Decision Tree (Oversampled) 0.911000 0.917000 0.975000 0.945000
Random Forest (Undersampled) 0.898000 0.919000 0.944000 0.931000
Gradient Boost (Undersampled) 0.896000 0.916000 0.945000 0.930000
AdaBoost (Undersampled) 0.892000 0.915000 0.943000 0.929000
Bagging (Undersampled) 0.864000 0.903000 0.918000 0.910000
Decision Tree (Undersampled) 0.825000 0.882000 0.887000 0.885000

Observations on Model Comparision¶

Original Dataset

  • Decision Tree: Achieved an accuracy of 93.4%. Showed good performance, but other models outperformed it.
  • Bagging: Achieved an accuracy of 96.2%. Showed strong performance and high recall, indicating a good ability to identify churned customers.
  • AdaBoost: Achieved an accuracy of 95.7%. Performed well, with a high precision and recall, effectively identifying churned customers and minimizing false positives.
  • Gradient Boost: Achieved an accuracy of 96.1%. Performed very well, similar to Bagging, with high precision and recall.
  • Random Forest: Achieved an accuracy of 96.4%. Showed the highest accuracy among all models, indicating excellent overall performance.

Oversampled Dataset

  • Models trained on the oversampled dataset generally showed a slight decrease in accuracy compared to the original dataset, but an increase in recall. This indicates better performance in identifying churned customers, but at the cost of increased false positives.
  • Random Forest and Bagging still showed the best performance on the oversampled dataset.

Undersampled Dataset

  • Models trained on the undersampled dataset showed lower accuracy and precision compared to the original and oversampled datasets. This indicates a trade-off between overall performance and the ability to identify churned customers.
  • Undersampling might not be the best approach for this dataset, as it leads to a significant loss of information.

  • The best 5 models are:
  1. Random Forest trained with original data
  2. Bagging trained with original data
  3. Gradient Boost trained with original data
  4. Random Forest trained with oversampled data
  5. Ada Boost trained with original data

We will tune the top 3 models using Random Search CV


Hyperparameter Tunning Of Best 3 Models¶

1. Random Forest

In [279]:
from sklearn.model_selection import RandomizedSearchCV

# we are tuning some hyperparameters right now, we are passing the different values for both parameters

grid_param = {
    "n_estimators": np.arange(10, 40, 10),
    "min_samples_leaf": np.arange(5, 10),
    "min_samples_split": [3, 5, 7],
    "max_features": ["sqrt", "log2"],
    "max_samples": np.arange(0.3, 0.7, 0.1),
}

randomforest_tunned = RandomizedSearchCV(estimator = randomforest,
                     param_distributions = grid_param,
                     cv = 5,
                    n_jobs = -1)

randomforest_tunned.fit(X_train, y_train)
Out[279]:
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=40),
                   n_jobs=-1,
                   param_distributions={'max_features': ['sqrt', 'log2'],
                                        'max_samples': array([0.3, 0.4, 0.5, 0.6]),
                                        'min_samples_leaf': array([5, 6, 7, 8, 9]),
                                        'min_samples_split': [3, 5, 7],
                                        'n_estimators': array([10, 20, 30])})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=40),
                   n_jobs=-1,
                   param_distributions={'max_features': ['sqrt', 'log2'],
                                        'max_samples': array([0.3, 0.4, 0.5, 0.6]),
                                        'min_samples_leaf': array([5, 6, 7, 8, 9]),
                                        'min_samples_split': [3, 5, 7],
                                        'n_estimators': array([10, 20, 30])})
RandomForestClassifier(max_features='log2', max_samples=0.6000000000000001,
                       min_samples_leaf=7, min_samples_split=5, n_estimators=30,
                       random_state=40)
RandomForestClassifier(max_features='log2', max_samples=0.6000000000000001,
                       min_samples_leaf=7, min_samples_split=5, n_estimators=30,
                       random_state=40)
In [280]:
best_parameter = randomforest_tunned.best_params_
print(best_parameter)
{'n_estimators': 30, 'min_samples_split': 5, 'min_samples_leaf': 7, 'max_samples': 0.6000000000000001, 'max_features': 'log2'}
In [281]:
randomforest_tunned.best_score_
Out[281]:
0.9367073547087678
In [286]:
rfcl_tuned = RandomForestClassifier(n_estimators=30, min_samples_split=5, min_samples_leaf=7, max_samples=0.6000000000000001, max_features='log2')
rfcl_tuned.fit(X_train, y_train)
Out[286]:
RandomForestClassifier(max_features='log2', max_samples=0.6000000000000001,
                       min_samples_leaf=7, min_samples_split=5,
                       n_estimators=30)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_features='log2', max_samples=0.6000000000000001,
                       min_samples_leaf=7, min_samples_split=5,
                       n_estimators=30)

2. Bagging

In [288]:
# we are tuning some hyperparameters right now, we are passing the different values for both parameters

grid_param = {
    "n_estimators": np.arange(10, 40, 10),
    "max_features": np.arange(0, 15),
    "max_samples": np.arange(0.3, 0.7, 0.1),
    "bootstrap": [True, False],
}

bagging_tunned = RandomizedSearchCV(estimator = bagging,
                     param_distributions = grid_param,
                     cv = 5,
                    n_jobs = -1)

bagging_tunned.fit(X_train, y_train)
Out[288]:
RandomizedSearchCV(cv=5, estimator=BaggingClassifier(random_state=40),
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_features': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
                                        'max_samples': array([0.3, 0.4, 0.5, 0.6]),
                                        'n_estimators': array([10, 20, 30])})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5, estimator=BaggingClassifier(random_state=40),
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_features': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14]),
                                        'max_samples': array([0.3, 0.4, 0.5, 0.6]),
                                        'n_estimators': array([10, 20, 30])})
BaggingClassifier(bootstrap=False, max_features=14, max_samples=0.5,
                  n_estimators=20, random_state=40)
BaggingClassifier(bootstrap=False, max_features=14, max_samples=0.5,
                  n_estimators=20, random_state=40)
In [289]:
best_parameter = bagging_tunned.best_params_
print(best_parameter)
{'n_estimators': 20, 'max_samples': 0.5, 'max_features': 14, 'bootstrap': False}
In [290]:
bagging_tunned.best_score_
Out[290]:
0.9526315255173309

3. Gradient Boost

In [291]:
# we are tuning some hyperparameters right now, we are passing the different values for both parameters
GradientBoostingClassifier()
grid_param = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
    }

gb_tunned = RandomizedSearchCV(estimator = gradientboost,
                     param_distributions = grid_param,
                     cv = 5,
                    n_jobs = -1)

gb_tunned.fit(X_train, y_train)
Out[291]:
RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(random_state=40),
                   n_jobs=-1,
                   param_distributions={'learning_rate': [0.01, 0.1, 0.2],
                                        'max_depth': [3, 5, 7],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [100, 200, 300]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5, estimator=GradientBoostingClassifier(random_state=40),
                   n_jobs=-1,
                   param_distributions={'learning_rate': [0.01, 0.1, 0.2],
                                        'max_depth': [3, 5, 7],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [100, 200, 300]})
GradientBoostingClassifier(learning_rate=0.2, min_samples_leaf=2,
                           min_samples_split=10, n_estimators=200,
                           random_state=40)
GradientBoostingClassifier(learning_rate=0.2, min_samples_leaf=2,
                           min_samples_split=10, n_estimators=200,
                           random_state=40)
In [292]:
best_parameter = gb_tunned.best_params_
print(best_parameter)
{'n_estimators': 200, 'min_samples_split': 10, 'min_samples_leaf': 2, 'max_depth': 3, 'learning_rate': 0.2}
In [293]:
gb_tunned.best_score_
Out[293]:
0.9723855293506155

Model Performance Comparison and Final Model Selection¶

In [300]:
# Create a list of model names and scores
model_names = ['Random Forest', 'Bagging', 'Gradient Boost']

# Create a list of dictionaries, where each dictionary represents a model's scores
model_scores = [
    {'Random Forest' : randomforest_tunned.best_score_},
    {'Bagging' : bagging_tunned.best_score_},
    {'Gradient Boost' : gb_tunned.best_score_}
]

# Create a DataFrame from the list of dictionaries
df_best_model_scores = pd.DataFrame(model_scores, index=model_names)

# Display the DataFrame
df_best_model_scores
Out[300]:
Random Forest Bagging Gradient Boost
Random Forest 0.936707 NaN NaN
Bagging NaN 0.952632 NaN
Gradient Boost NaN NaN 0.972386

Observation on Model Performance Comparison and Final Model Selection¶

  • We have evaluated the top 3 models which are Random Forest, Bagging ans Gradient Boost. On the basis of model best score we found Gradient Boost is well performing model.
  • On the basis of hyperparameter tunning we have found the final model which may be more tunned in future.
  • Among these 3 model Random Forest also will be do well but here we are not prefer it because of its computational cost.

Actionable Insights & Recommendations¶

Insights¶

  • Customer Churn Drivers: Customers with lower total transaction amounts, lower credit utilization ratios, and higher changes in transaction behavior between quarters are more likely to churn. These factors can indicate decreased engagement and satisfaction with the credit card services.
  • High-Risk Customer Segments: Graduate customers and females are more prone to churn. These segments require focused attention and targeted retention strategies.
  • Product Holding and Relationship: Customers with fewer products and a shorter relationship with the bank are also more likely to churn. Promoting product bundling and loyalty programs could improve customer retention.
  • Card Category: Blue card users have the most churn in our analysis. Focus on customer satisfaction of Blue card users to reduce the churn rate of this segment.
  • Income Category: Churn is most apparent for income categories less than \$40K and \$60K - \$80K. Consider reviewing product offerings and features for these segments to increase attractiveness and engagement.

Recommendations¶

  • Targeted Retention Strategies: Develop targeted retention campaigns for high-risk customer segments (graduates, females, lower-income customers, Blue card users) with tailored offers and incentives to discourage churn.
  • Product Bundling and Loyalty Programs: Promote product bundling and loyalty programs to increase customer engagement and strengthen the overall relationship with the bank.
  • Customer Engagement: Engage inactive customers with proactive communications and personalized offers to encourage greater usage and satisfaction. Focus on customers with lower total transaction amounts and utilization ratios.
  • Enhanced Customer Service: Address customer concerns and complaints promptly and effectively to improve customer satisfaction and prevent churn due to service-related issues. This is particularly relevant for the identified high-risk customer segments and card categories.
  • Product Review and Optimization: Regularly review and optimize product offerings, features, and pricing for different customer segments to ensure they meet customer needs and expectations. Focus on the Blue card category and potentially explore new strategies for this offering to reduce churn.
  • Personalized Marketing: Utilize customer data and analytics to develop personalized marketing messages and promotional offers that resonate with individual customer needs and preferences.
  • Real-time Monitoring: Implement a system for real-time monitoring of customer behavior and engagement to identify potential churn risks early and take proactive measures.

By addressing the identified insights and implementing the recommendations, Thera Bank can potentially reduce customer churn and enhance its overall customer retention strategy.